Advanced Practice in ML

Handling data labels

Label multiplicity

to obtain enough labeled data, companies often use different data sources and annotators. This leads to the problem of label ambiguity: what to do when there are multiple conflicting labels for a data instance.

Data lineage

= a technique that helps keep track of the origin of each of data samples as well as its labels

Handling imbalanced dataset

Imbalanced data set makes the normal classification metrics, like accuracy, not work well. There are some solutions:

Making feature generalization

Always consider two aspects with regards to generalization:

Model selection

Combining models

We can get an additional performance gain by combining strong models (base models) made with different learning algorithms.